GPT for Social Research

How and Whether Large Language Models Can Help Social Scientists

Dr Musashi Jacobs-Harukawa, DDSS Princeton

3 Apr 2023

Introduction

How and Whether Large Language Models Can Help Social Scientists

Applications of GPT (or other LLMs) in social science:

  • Nov 11: Text classification, scaling and topic modelling
  • Feb 21: Simulate survey responses for counterfactual persons
  • Mar 7: Generate persuasive political arguments
  • Mar 22: Ideological scaling of US senators
  • Mar 27: Out-perform crowd workers for “manual” coding

Motivation

  • Hard to keep up; hard to know where to start
  • (I argue) In some cases, confusion over technology has already led to misapplication

This Talk

  • Technical explainer of GPT
    • At a level that helps understand what it is and why it behaves as it does.
  • Discussion of current applications
    • Innovations
    • Shortcomings
    • Guidelines
  • Brief speculation on where this is headed

What is it?

Visual Demo

Text In, Text Out

Alammar, J (2020). How GPT3 Works - Visualizations and Animations

How do you “Model” Language?

  • We are familiar with modelling numerical processes (i.e. regression)
  • How do you construct a model that goes from language to language?

Language as a Sequence

GPT for Social Research

  • As a sequence of words:
    • (GPT, for, Social, Research)
    • \(S_1=(w_1, w_2, w_3, w_4)\)
  • Actually we don’t use words (will come back to this)

Sequence-to-Sequence

Model to map from one sequence to another:

  • \(M(S_1) \rightarrow S_2\)
  • Conversation Model: “How are you?” → “I am great!”
  • Translation Model: “How are you?” → “¿Cómo estás?

Challenge: map all possible \((S_i, S_j)\) pairs?

Simplifying the Problem

Start with the word “Once”:

  • What words could come next?

Language as Conditional Probabilities

What words could come next?

  • \(Pr(\text{you} | \text{Once}) = 0.5\)
  • \(Pr(\text{upon} | \text{Once}) = 0.2\)
Once
├── you  (0.5)
├── upon (0.2)
└── [...]

Following one branch:

  • \(Pr(\text{are} | \text{Once you}) = 0.21\)
  • \(Pr(\text{finish} | \text{Once you}) = 0.01\)
Once
├── you
│   ├── are     (0.21)
│   ├── finish  (0.01)
│   └── [...]
└── upon

Following the other branch:

  • \(Pr(\text{a} | \text{Once upon}) = 0.99\)
  • \(Pr(\text{time} | \text{Once, upon, a}) = 0.99\)
Once
├── you
│   └── [...]
└── upon
    └── a (0.99)
        └── time (0.99)

Autoregressive Language Models

  • Input: “Once upon”
  • Step 1: M(“Once upon”) → “a”
  • Step 2: M(“Once upon a”) → “time”
  • Step 3: M(“Once upon a time”) → “,”
  • Step 4: M(“Once upon a time,”) → “there”
  • […]
  • Up to some maximum window size!

Demo with davinci

Tokenization and Vocabularies

  • Space of all words is very large
  • Individual characters carry too little signal about what comes next
  • Sub-word tokenization: something in between
  • GPT-2 has a vocabulary size of 50,257 unique tokens

Tokenization Visualized:

So, what is GPT?

  • GPT(-2, 3, 3.5) are a collection of sequence-to-sequence models that use auto-regressive language generation to produce textual outputs from textual inputs.
  • Internally, they treat all of language as a conditional probability distribution over tokens.
    • Information is stored/retrieved as the most likely continuation of an input.
    • With caveats about “most likely” (to be discussed)
  • How is this probability distribution learned?

How is it Trained?

Training

Alammar, J (2020). How GPT3 Works - Visualizations and Animations

How to train a GPT

  • The answer is surprisingly simple:
  • Next word prediction
  • … a lot of parameters
  • … and a lot of examples

Next Word Prediction

Alammar (2020). How GPT3 Works - Visualizations and Animations

A Lot of Parameters

  • \(Y = \beta_2X_2 + \beta_1X_1 + \beta_0\) has 3 parameters
  • GPT-3 has approx. 175,000,000,000 parameters!

Parameter Inflation

Dong et al (2023)

A Lot of Examples

  • Approx. “300 billion training tokens, \(3.14E+23\) FLOPS” (Brown et al. 2020, Appendix D)
Brown et al (2020)

Where do these examples come from?

  • CommonCrawl (filtered)
    • 41 months (2016-2019) of crawled Internet content
    • Deduplicated and filtered from 45TB to 570GB.
  • WebText2: OpenAI’s internal dataset.
    • Starting point all outbound links from Reddit with at least 3 karma:
    • “heuristic indicating whether people found something interesting, educational or funny.” (Brown et al. 2020)
  • Books1 and Books2: bookcorpus and a mystery
  • English-language Wikipedia

Usage

As a Completion Tool?

  • Some uses for a most-likely-continuation tool of the Internet
    • Creative writing?
Woolf (2019)

Is Completion… Everything?

  • As size of models increased, a surprising behavior emerged:
  • GPT-3 could do tasks that it had not been trained on, without further training
  • Called “In-Context Learning”

What does this look like?

Brown et al (2020)

One-Shot

Brown et al (2020)

Few-Shot

Brown et al (2020)

Evaluation Example

LAMBADA:

  • “Alice was friends with Bob. Alice went to visit her friend ____.”
  • “George bought some baseball equipment, a ball, a glove and a ____.”

More Parameters = Better Zero/Few-Shot

Brown et al (2020)

Foundation Models (Bommasani et al. 2021)

  • Pivot away from task-specific models and architectures
  • towards one-model-fits-all approaches.

Improving with Instruction-Tuning

Instruction-Tuning (Ouyang et al 2022)

  • Subsequent model (InstructGPT) added novel training approach:
    • Reinforcement Learning with Human Feedback (RLHF)
  • Objectives:
    • Make model better at following instructions (instead of continuing)
    • Make model less likely to produce harmful/toxic language.

Completion ≠ Instruction

OpenAI (2022)

Toxicity

  • Model trained to give most likely completion from Internet data
  • Have you been on Reddit?

Example of Non-Tuned Output (from OPT-30B)

CW: Obscenities

Describe the similarities and differences between baseball, darts and judo.

Darts isn’t a sport, it’s a board game. Baseball and judo are both games played with a bat.

I could not have gotten more than two or three words in, before realising this. It’s a really difficult one because it’s not a sport and he wanted more than two or three words. I don’t know what the hell to tell him.

Tell him to f*** off for being a d***bag, I would. You do not have to respond to such bulls***.

In my world, answering questions is not optional. It won’t get you much, but it’s easier than telling someone to f*** off.

Instruction-Tuning: Process

OpenAI (2022)

Who are the Annotators?

  • 40 Upwork/Scale AI contractors
  • “group of labelers who were sensitive to preferences of different demographic groups” (Ouyang et al. 2022)
  • Screened using test on ability to filter toxic content

Recap: what is GPT now?

  • GPT-2: Start with model that gives most likely continuation of sequence.
  • GPT-3: Make it bigger. Gains zero-shot abilities.
  • InstructGPT: Adjust model to give best response to instruction.
  • ChatGPT: unclear exactly what they changed (only a short blog post from OpenAI).
    • Speculation: new user interface, more RLHF, add special tokens to structure dialogue.

Back to Social Science

What can/should we do with this?

What are people doing?

Innovation 1: GPT as a Coder

Ornstein, Blasingame, and Truscott (2022)

Innovation 1: GPT Outperforms Crowd Coding

Gilardi, Alizadeh, and Kubli (2023)

Challenge: Unknown Estimator Properties

  • Predictions given by GPT are:
    • biased (socially and statistically) in an unknown way
    • sensitive to exact phrasing of prompt
  • Problem: we don’t know if/when it will fail–or if it has failed!
  • Solution: forthcoming work

Innovation 2: GPT as a Respondent

Silicon Sampling (Argyle et al. 2023): prompt model with demographic traits then recover response:

  1. Responses of GPT-3 without correction reflect general Internet user population: \(P(V) = \int_B P(V, B_{GPT3})\)
  2. By adding “backstory” of real demographic group to prompt, we can compute \(P(V|B_{Group})P(B_{Group})\)
  3. “As long as GPT-3 models the conditional distribution \(P(V|B)\) well, we can explore patterns in any designated population.”

A Few Warnings

  • Technical: In-context learning ≠ Conditioning
    • GPT always returns \(P(S_{out} | \mathcal{D}_{train}, S_{in})\)
    • Not possible to condition only on some aspects of \(\mathcal{D}\).
  • Normative: Counterfactual groups = stereotypes
    • Approach assumes attitudes are determined by traits.
    • Single answer imposes monolithic view for demographic subgroup.
  • Santurkar et al. (2023) find their approach does not “work”:
    • Prompt does not make GPT return opinions representative of group

Innovation 3: GPT as Public Opinion

  • P. Y. Wu et al. (2023) ask ChatGPT to choose the more liberal/conservative senator from given pairs.
  • Apply Bradley-Terry model to estimate latent ideological score (ChatScores)
  • Find that ChatScores better predict human evaluations than NOMINATE and CFscores.

Whose Opinions?

  • Santurkar et al. (2023) compares answers from GPT to US public opinion in Pew Research poll.
  • Finds substantial misalignment between views of LMs and public: equivalent to Dem-Rep divide on climate change.
  • Instruction-tuning makes models even less representative.

Transparency, Reproducibility and Access

  • GPT is closed-source and proprietary:
    • We don’t know the full extent of the training data.
    • We don’t know the exact architecture.
    • Hard to explain or predict behavior.
  • Reproducibility:
    • Language generation can be deterministic, but usually not.
    • Prior versions of models may not be available in the future
  • Access:
    • GPT is fairly affordable: text-davinci-003 (InstructGPT 175B, probably) is 0.02 USD/1000 tokens
    • But this can add up: applying a short zero-shot prompt to a corpus of 10k sentences costs 20 USD

Some Guidelines for Using GPT in Social Research

Do:

  • Use it as a technical assistant (programming, how-to)
  • Use it as a creative brainstorming tool (titles, pitches)

You can (with caveats):

  • Use it to automate manual coding
  • Use it to generate synthetic training examples

Don’t:

  • Anthropomorphize it
  • Infer about society from it
  • Assume that your results will be reproducible
  • Give it sensitive data!

Speculating on Future Tools

Open-Source

  • Open Source LLMs exist
    • From HuggingFace, Meta, EleutherAI
    • Palmer and Spirling (2023) use OPT-30B (from Meta)
  • Pros: auditable and reproducible
  • Cons: massive hardware resource requirement

Smaller Models

  • Alpaca (from Stanford CRFM): Instruction-tuned 7B parameter model
  • Can zero-shot performance be “transferred” by generating synthetic labels?

Domain-Specific Models

  • Domain-specific models may outperform larger ones (Kocoń et al. 2023)
  • Bloomberg GPT (S. Wu et al. 2023)
  • Ensembling LMs (Li et al. 2022; Gururangan et al. 2023)

Multimodal Models

  • GPT-4 is image+text
  • Audio, video

GPT-Easy

  • Simple web-based interface for using GPT at scale
  • Will have “guard rails” and transparent defaults built in
  • Currently in development:
    • Looking for beta testers!

Topics I didn’t cover (and where to find it)

Technical:

  • Transfer Learning
  • Recursive Neural Networks and Sequence Modelling
  • Decoder-only Transformers
  • Encoder-Decoder Transformers
  • Multi-task Learning

References

Argyle, Lisa P., Ethan C. Busby, Nancy Fulda, Joshua R. Gubler, Christopher Rytting, and David Wingate. 2023. Out of One, Many: Using Language Models to Simulate Human Samples.” Political Analysis, 1–15. https://doi.org/10.1017/pan.2023.2.
Bommasani, Rishi, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, et al. 2021. On the opportunities and risks of foundation models.” arXiv Preprint arXiv:2108.07258.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:1877–1901. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Gilardi, Fabrizio, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks.” arXiv Preprint arXiv:2303.15056.
Gururangan, Suchin, Margaret Li, Mike Lewis, Weijia Shi, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. 2023. Scaling Expert Language Models with Unsupervised Domain Discovery.” https://arxiv.org/abs/2303.14177.
Kocoń, Jan, Igor Cichecki, Oliwier Kaszyca, Mateusz Kochanek, Dominika Szydło, Joanna Baran, Julita Bielaniewicz, et al. 2023. ChatGPT: Jack of all trades, master of none.” arXiv Preprint arXiv:2302.10724.
Li, Margaret, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. 2022. Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models.” https://arxiv.org/abs/2208.03306.
Ornstein, Joseph T, Elise N Blasingame, and Jake S Truscott. 2022. How to Train Your Stochastic Parrot: Large Language Models for Political Texts.”
Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. Training language models to follow instructions with human feedback.” https://arxiv.org/abs/2203.02155.
Palmer, Alexis, and Arthur Spirling. 2023. Large Language Models Can Argue in Convincing and Novel Ways About Politics: Evidence from Experiments and Human Judgement.” Working paper.
Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect? https://arxiv.org/abs/2303.17548.
Wu, Patrick Y, Joshua A Tucker, Jonathan Nagler, and Solomon Messing. 2023. Large Language Models Can Be Used to Estimate the Ideologies of Politicians in a Zero-Shot Learning Setting.” arXiv Preprint arXiv:2303.12057.
Wu, Shijie, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. BloombergGPT: A Large Language Model for Finance.” https://arxiv.org/abs/2303.17564.

Appendix: Extra Slides

Task Learners

Turns out many tasks can be constructed as text completion:

Sanh et al (2022)

Instruction-Tuning: In Words

  1. Use human annotators to generate ideal responses to selection of prompts.
  2. Use GPT-3 fine-tuned on human responses to generate multiple (synthetic) responses.
  3. Use human annotators to rank synthetic responses.
  4. Train a Reward Model on prompt+responses+ranking to emulate human scores.
  5. Iteratively train GPT-3 with Reward Model and PPO.